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This  report  summarizes  the  research  performed  under  AFOSR  grant  91-0403.  Sec¬ 
tion  1  presents  a  virtual  checkpointing  scheme  for  recovery.  Section  2  presents  schemes 
for  implementing  reliable  memory.  Roll-forward  recovery  schemes  for  duplex  systems  are 
discussed  in  Section  3.  Section  4  discusses  REACT,  a  tool  for  reliable  architecture  charac¬ 
terization  and  it  application  to  reliability  evaluation  of  TMR  systems.  Section  5  discusses  a 
new  approach  for  low-cost  system  level  diagnosis.  Section  6  presents  results  on  the  reliability- 
safety  trade-off  in  modular  redundant  systems. 

1  Virtual  Checkpointing 

Virtual  Checkpoints  combines  concepts  from  two  database  recovery  techniques  of  shadow 
paging  and  twin  paging  to  support  checkpoint  and  rollback  recovery  in  the  virtual  memory 
translation  hardware  [1,  5,  6j.  The  concept  of  supporting  the  active  data  is  implemented  by 
dynamically  allowing  a  second  copy  of  the  virtual  page.  The  active  pages  can  be  identified 
by  the  use  of  a  checkpoint  counter  associated  with  each  page.  In  addition  to  detecting  active 
pages  in  a  rollback  situation,  the  counters  also  allow  the  checkpoint  processing  to  be  deferred 
past  the  exact  instance  of  the  checkpoint  (assuming  a  fault  tolerant  memory). 

The  technique  supports  two  classes  of  data  within  the  virtual  memory  system  (i.e., 
active  and  checkpoint).  Each  class  still  supports  the  traditional  two  level  store  of  virtual 
memory  (i.e.,  real  memory  and  paging  disk).  Similar  to  the  other  schemes,  virtual  check¬ 
points  must  be  able  to  detect  all  active  pages  and  make  all  active  pages  permanent  at  the 
checkpoint  time.  This  is  achieved  by  having  a  global  checkpoint  counter  (V)  covering  all  the 
data  and  local  checkpoint  counters  (i  )  for  individual  pages.  Essentially,  the  global  check¬ 
point  counter  is  copied  to  the  I  -cal  counter  on  every  reference  (note:  this  is  the  logical 
description  and  does  not  actually  occur  on  every  reference).  Thus,  active  data  is  the  pages 
with  the  local  counter  v  equal  to  the  global  counter  V .  The  global  counter  V  is  incremented 
when  a  checkpoint  is  taken.  Tiius,  all  active  pages  become  checkpoint  versions  when  the 
global  counter  V  is  incremented.  Figure  1  illustrates  the  basic  concepts.  Virtual  page  k 
has  not  been  referenced  since  the  prior  checkpoint  Page  j  has  been  accessed  in  the  current 
interval  and  has  both  an  active  and  checkpoint  version.  Note  that  in  all  cases  the  mapping 
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refers  to  both  a  real  storage  frame  and  a  disk  slot. 
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Figure  1:  Virtual  Checkpointing:  Basic  concept  ^ 

The  mappings  for  each  virtual  page  are  replicated  and  are  referred  to  as  m0  and 
mi.  A  mapping  m(  contains  mappings  for  the  real  frame  ( rt )  and  the  disk  copy  (dt).  Each 
virtual  page  has  a  one  bit  field,  l,  which  can  be  thought  of  as  a  switch  that  points  to  the  ^ 

most  recently  used  mapping  (i.e. ,  m0  or  mi).  Thus,  the  notation  mi  refers  to  the  mapping 
that  was  used  last.  In  addition,  each  page  has  a  k  bit  local  checkpoint  number  (n)  which 
contains  a  copy  of  the  global  checkpoint  number  (V)  during  the  most  recent  reference.  The 
checkpoint  number,  V ,  is  a  global  value  which  is  incremented  on  every  checkpoint  (the  scope  * 

of  V  determines  the  scope  of  the  checkpoint,  e  g.,  the  entire  system,  a  single  address  space 
or  portions  of  an  address  space). 

An  important  aspect  of  the  scheme  is  that  the  actions  of  taking  a  checkpoint  are  not  ^ 

concentrated  at  the  actual  time  of  the  checkpoint  but  rather  are  distributed  over  the  time 
following  the  checkpoint.  This  is  because,  under  the  assumption  of  fault  tolerant  memory,  the 
only  action  required  to  perform  a  checkpoint  is  to  increment  the  global  checkpoint  counter 
V .  The  processing  for  the  individual  pages  is  deferred  until  the  first  reference  to  the  page  I 

after  the  checkpoint.  In  order  to  determine  whether  the  deferred  processing  must  occur, 


» 


2 


the  values  V  and  v  must  be  compared  on  every  reference  (using  the  translation  look-aside 
buffer  with  the  scheme  avoids  having  to  actually  make  this  comparison  on  every  reference). 

Thus,  when  a  page  is  referenced  it  is  either  the  case  where  the  checkpoint  processing  must 

occur  ( v  yZ  V)  or  a  normal  access  to  the  active  page  (t>  =  C).  Figure  2  shows  a  situation  i 

where  checkpoints  were  taken  at  times  tci  and  Zc2.  Consider  the  events  at  time  The 

active  page  addressed  by  mi  (Z  =  1)  was  last  referenced  at  time  tm  (thus  v  —  1)  The 

reference  at  time  Zr2  is  the  first  reference  after  the  checkpoint  (because  v  yZ  V)  and  the 

contents  of  page  mi  must  be  preserved  as  the  checkpoint  page.  Furthermore,  the  contents 

of  page  m4  must  be  used  while  the  resources  of  the  old  checkpoint  page  m0  (whose  contents 

are  no  longer  required)  are  used.  Once  the  valid  data  has  been  copied  to  m0,  the  Z-bit  is 

inverted  (to  Z  =  0)  so  that  mo  becomes  the  active  and  mi  becomes  the  checkpoint.  Finally, 

the  global  checkpoint  number  V  is  copied  to  the  local  checkpoint  number  for  this  page  so 

that  on  the  next  access  in  the  checkpoint  interval  a  normal  translation  occurs.  Figure  3 

tel  tci  * 

V  =  0  V  =1  V  =  2 

t  t  r 

Zfio  mo  Zfii  :  JTij  Zflj  l 

checkpoint  active 
v  =  1 

Figure  2:  Case  1-  first  reference  after  checkpoint. 

shows  the  situation  at  the  next  reference  in  this  checkpoint  interval.  A  reference  at  time  tR 3 
proceeds  normally  to  the  active  data  at  m0  because  V  matches  v. 

A  rollback  requires  discarding  any  data  that  has  been  modified  since  the  prior  check-  ( 

point.  If  the  page  has  not  yet  been  referenced  since  the  prior  checkpoint  then  the  page  is 
essentially  in  a  rolled  back  state  and  nothing  needs  to  be  done  (e  g.,  Case  1  in  Figure  2). 

If  the  page  has  been  referenced  since  the  prior  checkpoint  then  there  is  an  active  page  that 

must  be  discarded.  For  example,  if  a  failure  occurs  at  time  3  in  Figure  3,  one  wants  to  » 

discard  mo  and  restore  m.\.  So  for  all  pages  with  \'  —  v,  the  v  value  is  decremented  and  the 
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Figure  3:  Case  2-  page  previously  referenced. 

1  bit  is  inverted.  This  forces  the  state  to  be  like  Figure  2  where  mi  contains  the  checkpoint 
and  m0  contains  useless  information. 

2  Reliable  Memory  Design 

The  use  of  a  hybrid  memory  structure  consisting  of  both  highly  reliable  and  normal  memory 
can  further  support  persistent  and  recoverable  memory  [1],  Hybrid  algorithms  that  man¬ 
age  the  writable  memory  and  read-only  memory  separately  are  proposed.  The  traditional 
measures  of  virtual  memory  algorithms  (i.e.,  lifetime  and  space-time)  have  been  extended  to 
account  for  the  dual  nature  of  the  policies.  Several  properties  of  the  policies  have  been  ex¬ 
plored.  It  has  been  shown  that  the  knee  of  a  hybrid  lifetime  curve  produces  a  near  minimum 
space-time  product  as  with  the  existing  algorithms.  Hybrid  policies  are  more  controllable 
with  respect  to  highly  reliable  memory  because  they  can  constrain  the  amount  of  writable 
memory  and  gain  performance  by  using  additional  read-only  memory.  The  lifetime  mea¬ 
sure  for  the  hybrid  policies  under  constrained  writable  memory,  when  compared  at  equal 
amounts  of  highly  reliable  memory,  is  better  than  the  single  policy  algorithm  at  a  small 
cost  of  additional  read-only  memory.  Furthermore,  even  at  an  unconstrained  amount  of 
writable  memory,  the  hybrid  policy  produces  approximately  equal  performance  while  the 
writable  memory  can  be  completely  fixed  in  size.  Theoretical  results  are  also  derived  for  a 
property  which  indicates  the  optimal  performance  for  a  hybrid  reference  stream  based  on 
two  individual  streams. 
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The  ability  to  accurately  predict  the  reliability  of  a  system  is  very  important.  Two 
novel  techniques  have  been  developed  which  focus  on  dynamic  aspects  of  memory  [2,  3,  4,  7]. 
The  first  focuses  on  the  memory  reference  patterns  of  a  particular  program  while  the  second 
looks  at  memory  behavior  due  to  memory  management  actions. 

The  first  novel  technique  evaluates  the  probability  of  correct  execution  of  a  program 
based  on  the  program’s  memory  access  behavior.  The  approach  is  an  analytical  study  using 
an  existing  model  which  characterizes  an  address  trace  with  four  parameters.  Three  cases  are 
developed  based  on  the  storage  allocation  policy  (i.e. ,  pre- allocated,  dynamically  allocated, 
or  constrained  in  allocation).  The  models  are  able  to  compare  the  traditional  view  that  is 
taken  in  standard  memory  reliability  analysis  to  that  of  a  real  world  environment  where  a 
program  uses  a  varying  fraction  of  the  memory  at  different  instances.  Using  these  models,  it 
is  shown  that  the  reliability  may  be  significantly  better  than  the  apparent  reliability  when 
the  program  behavior  was  not  considered.  It  provides  one  explanation  for  the  cause  of 
unobserved  faults  along  with  an  analytical  basis  for  determining  the  extent  of  faults  not 
being  observed.  Possibly  the  most  important  application  of  these  models  is  to  analytically 
quantify  the  observed  phenomenon  that  failure  rates  increase  with  increased  workload.  A 
new  explanation  has  been  proposed  for  this  phenomenon  based  on  the  notion  that  programs 
often  have  storage  allocated  which  will  never  be  referenced  again  and  cannot  cause  a  failure. 
Assuming  a  constant  fault  rate  over  increased  workloads,  the  model  shows  that  there  could 
be  a  significant  increase  in  observed  failures.  The  model  was  validated  with  actual  program 
traces  and  shown  to  be  very  accurate.  Finally,  several  techniques  have  been  shown  for 
extracting  the  fractal  parameters  of  a  program  trace. 

The  second  novel  technique  for  reliability  analysis  uses  the  memory  space  allocated 
to  more  accurately  calculate  the  reliability  This  can  be  used  to  understand  the  relationship 
between  the  amount  of  memory  allocated  and  the  reliability.  This  effect  has  been  quantified 
based  on  the  relative  cost  of  a  fault.  Distinct  effects  have  been  measured  depending  on 
the  relative  speed  of  the  paging  device  For  small  reload  times  it  is  found  that  a  decrease 
in  the  memory  partition  size  leads  to  an  increase  in  reliability  at  the  cost  of  additional 
instruction  overhead.  For  extremely  long  reload  times  it  is  found  that  larger  amounts  of 
memory  lead  to  increased  reliability.  There  also  exists  a  middle  reload  time  where  the 
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optimal  reliability  corresponds  to  the  optimal  space-time  performance.  Other  aspects  of 
virtual  memory  algorithms  such  as  small  pages  and  different  paging  algorithms  were  studied. 
Furthermore,  the  methodology  was  applied  to  study  the  reliability  of  cache  memories  which 
have  the  characteristic  of  very  small  reload  delays.  The  results  show  that  the  reliability 
improvement  factor  can  change  by  several  orders  of  magnitude  baa^d  on  the  cache  size.  For 
small  memory  sizes  it  was  found  that  a  very  small  number  of  page  durations  contribute  to 
a  majority  of  the  total  unreliability.  Two  techniques  have  been  suggested  to  remove  these 
long  durations,  which  then  lead  to  even  greater  improvements  in  the  reliability.  One  is 
an  algorithm  called  selective  scrubbing  to  break  the  long  durations,  which  could  either  be 
implemented  in  software  or  hardware.  A  second  technique  showed  that  the  addition  of  very 
small  amounts  of  highly  reliable  memory  can  also  lead  to  significant  reliability  improvements. 

3  Roll-forward  Checkpointing  Schemes 


A  fault -tolerant  multiprocessor  environment  wherein  each  task  is  executed  simultaneously 
on  two  processing  modules  is  considered.  A  pool  of  a  small  number  of  nondedicated  spares 
or  processing  modules  witii  spi'e  processing  capacity  is  assumed  available  (see  Figure  4). 
Duplex  fault-tolerant  architectures  that  require  no  rollback  for  most  faults  are  proposed. 


Spare 


VS:  Volatile  Storage 


Figure  4:  System  architecture  for  roll-forward  checkpointing  schemes 
In  the  proposed  schemes,  at  each  checkpoint  the  state  of  the  two  modules  executing 


» 


6 


» 


the  task  is  compared  for  detection  of  faults.  If  a  fault  is  detected,  instead  of  usual  rollback, 
the  following  mechanism  is  used  for  identification  of  the  faulty  processing  module  [13,  14,  17  . 
The  good  state  of  the  previous  checkpoint  is  loaded  into  a  spare  module.  The  checkpoint 
interval  in  which  the  failure  is  detected  is  then  “retried”  on  the  spare  module.  Concurrently, 
the  task  continues  execution  on  both  processing  modules  in  the  duplex  system  At  the  next 
checkpoint  the  state  of  the  spare  is  compared  with  the  state  of  the  two  processing  modules 
at  the  previous  checkpoint  where  disagreement  occurred.  This  allows  for  the  identification 
of  the  faulty  module  (see  Figure  5).  Once  the  faulty  module  is  identified,  the  state  of  the 
faulty  module  is  made  consistent  with  the  state  of  the  fault-free  module  in  the  duplex  system 
and  the  spare  is  released  to  the  pool 


1  :  Copy  state  to  the  spare 

2  :  Compare  state  of  the  spare  with  the  state  of  A  and  B 

3  :  Copy  state  from  A  to  B 

X  A  fault 

Figure  5:  Roll  forward  checkpointing  scheme 

These  schemes  are  termed  as  Roll-Forward  Checkpointing  Schemes  (RFCS).  The 
proposed  RFCS  schemes  provide  a  mechanism  for  identifying  the  faulty  processing  module 
and  recovering  it,  in  most  cases,  without  the  overhead  of  rollback.  It  is  demonstrated  that  the 
proposed  schemes  have  potential  performance  advantages  over  conventional  duplex  system 
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with  rollback. 

Specifically,  the  advantage  of  the  proposed  schemes  is  that  they  achieve  a  lower  aver¬ 
age  execution  time  with  a  lower  variance  as  compared  to  the  rollback  schemes.  This  is  crucial 
for  real-time  systems  with  hard  deadlines  as  lower  variance  enhances  the  predictability  of  ► 

the  task  completion  time. 

4  Synthesis  and  Evaluation  of  Alternative  Fault-Tolerant 
Architectures 

Another  direction  of  our  research  was  the  study  of  alternative  fault-tolerant  computer  sys¬ 
tems.  Our  continuing  goal  is  to  synthesize  and  evaluate  novel  a-chitectures  which  offer  » 

increased  performance  and/or  require  less  hardware  than  traditional  designs  while  provid¬ 
ing  nearly  the  same  dependability.  We  are  specifically  interested  in  the  class  of  architectures 
which  can  be  represented  by  the  generalized  system  model  pictured  in  Figure  6.  This  mul¬ 
tiprocessor  abstraction  consists  of  multiple,  possibly  redundant,  processor  (P)  and  memory  ® 

(M)  modules  interconnected  through  some  form  of  error  control  logic  (such  as  voters,  com¬ 
parators,  switches  or  error  correcting  codes).  A  wide  variety  of  highly  dependable  architec¬ 
tures  fit  this  model:  static,  dynamic  and  hybrid  redundancy,  systems  with  coding  plus  many  ^ 

non-fault- tolerant  multiprocessors. 

Reliability  and  availability  are  the  metrics  used  to  judge  the  efficacy  of  the  perfor¬ 
mance/redundancy  tradeoffs  being  investigated  Many  hardware  and  software  attributes 
influence  the  dependabihty  of  a  system,  including  specific  fault  characteristics,  error  con-  * 

tainment  ability  and  variations  in  w  rkload  We  are  particularly  concerned  with  the  effect 
program  behavior  has  on  reliability  I  e-tatled  system  models  which  account  for  these  factors 
are  often  very  difficult  to  formulate  through  analytical  techniques  (such  as  combinatorial  ^ 

and  Markov  models)  which  are  rorntnonlv  used  for  dependability  assessment.  In  order  to 
facilitate  our  research,  we  have  developed  a  simulated  fault-injection  testbed  called  REACT 
to  experimentally  analyze  the  dependability  of  these  new  computer  architectures. 
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Figure  6:  Generalized  System  Model 

4.1  The  Reliable  Architecture  Characterization  Tool 

The  Reliable  Architecture  Characterization  Tool  (REACT)  is  a  software  testbed  which  per¬ 
forms  automated  Inc  testing  of  many  user-defined  multiprocessor  systems  through  simulated 
fault-injection  [9,  32].  This  involves  emulating  the  high-level  hardware  and  software  compo¬ 
nents  of  a  given  system  while  concurrently  injecting  bit-level  faults  and  errors  into  it.  During 
a  single  simulation  run,  the  code  conducts  a  certain  number  of  experiments  or  trials  in  which 
an  initially  fault-free  system  is  operated  until  it  fails  or  reaches  a  specified  censoring  time. 
The  exact  number  of  trials  required  is  determined  by  the  desired  confidence  intervals  about 
the  system  dependability  attribute  being  investigated.  Extensive  instrumentation  has  been 
included  in  the  program  in  order  to  collect  data  from  each  trial  which  is  later  aggregated 
over  the  entire  simulation  run  in  order  to  generate  the  outputs.  Graphs  of  reliability  and 
availability,  a  comprehensive  failure  mode  report  and  various  statistical  measurements  are 
provided  as  output  by  the  software.  REACT  consists  of  8000  lines  of  C  running  under  UNIX 
and  completes  a  “typical”  simulation  run  in  less  than  10  hours  on  a  dedicated  DECstation 
5000/120. 

REACT  can  analyze  the  c'ass  of  architectures  which  was  shown  previously  in  Figure  6. 
Any  number  of  processor  and  memory  modules  may  be  specified  and  each  can  be  designated 
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as  initially  active  or  a  hot  or  cold  standby  spare.  Groups  of  processors  or  memories  may  also 

be  defined  in  which  all  modules  operate  redundantly.  The  error  control  logic  may  be  built 

from  various  combinations  of  components  commonly  found  in  fault-tolerant  designs.  Custom 

error  control  logic  circuitry  may  also  be  specified  by  the  user.  Processors  are  simulated  at  i 

the  functional-level  whereas  a  logical-level  description  is  used  for  the  memory  modules  and 

error  control  logic.  Logic  values  0  and  1  are  not  differentiated  in  the  system  model:  only 

error-free  and  erroneous  states  exist  for  each  bit.  Memory  depth  is  variable  and  a  16- bit 

word  width  for  memory  and  all  data  paths  has  currently  been  implemented.  Other  word  * 

sizes  may  be  realized  with  minor  modifications  in  the  code. 

A  synthetic  workload  is  assumed  in  which  processors  continually  perform  computa¬ 
tion  cycles  consisting  of  an  instruction  fetch,  a  possible  operand  read,  a  computation  and  a 
possible  result  write.  Real  code  and  data  are  not  used  by  REACT,  but  errors  are  allowed 
to  propagate  throughout  the  system  as  if  the  application  program  was  actually  being  ex¬ 
ecuted.  Behavior  of  the  application  workload  is  specified  by  a  mean  instruction  execution 
rate,  the  probabilities  of  performing  a  data  read  and  write  per  instruction  plus  a  locality  of  > 

reference  model.  Values  for  the  mean  number  of  data  accesses  made  during  the  execution  of 
an  instruction  may  be  obtained  either  through  trace  analysis  or  directly  from  the  measure¬ 
ment  of  operational  hardware.  It  is  assumed  that  all  memory  references  access  one  whole 
word  Which  memory  locations  are  accessed  during  a  computation  cycle  are  determined  via  * 

the  locality  of  reference  model  The  testbed  implements  a  model  based  on  Bradford-Zipf 
distributions  which  suggests  that  a  %  of  all  accesses  go  to  3  %  of  the  memory  under  the 
condition  a  +  3  —  1  Reference  addresses  are  assumed  to  be  uniformly  distributed  inside 
and  outside  of  the  locality  and  ii"  attempt  is  made  to  separate  code  from  data  in  memory 
with  the  model. 

The  fault/error  model  employed  by  REACT  accounts  for  permanent,  intermittent 
and  transient  faults  in  the  processors  plus  permanent  and  transient  faults  in  the  memories  t 

as  well  as  the  error  control  logic.  Faults  with  a  Weibull  distribution  (of  which  the  exponential 
distribution  is  a  subset)  for  their  inter-arrival  times  are  injected  into  these  modules  only  at 
the  beginning  of  a  computation  cycle  Faults  are  assumed  to  always  cause  immediate  errors, 
so  their  fault  (but  not  error)  latency  is  0.  Correlated  failures  are  presently  not  considered. 


10 


WEBSTER,  NEW  YORK  14580 
(716)  265-1600 


Processor  fault  effects  are  assumed  to  be  completely  characterized  by  the  rate  at 
which  errors  appear  on  its  memory  bus.  Three  types  of  errors  exist:  transients  lasting  only 
one  computation  cycle,  intermittents  with  a  Weibull  distributed  duration  and  permanents 
which  have  an  effect  in  every  computation  cycle.  Errors  may  affect  either  addresses,  (write) 
data  or  both  addresses  and  data  simultaneously.  An  erroneous  address  is  assumed  to  access 
a  random  memory  location  while  erroneous  data  take  a  random  value.  In  addition,  erroneous 
processor  reads  generate  output  errors  in  the  same  computation  cycle. 

Memory  faults  are  divided  among  the  bit-array  and  addressing-logic  regions  of  a  mem¬ 
ory  module.  The  fraction  of  faults  which  fall  into  each  of  these  regions  may  be  approximated 
by  their  relative  chip  areas.  Bit-array  faults  are  assumed  to  affect  a  single  random  bit  in  a 
word  at  a  random  address  while  a  random  location  is  referenced  during  an  addressing-logic 
fault.  A  transient  bit-array  fault  may  be  overwritten  (changing  it  from  the  erroneous  to 
error-free  state)  at  any  time,  but  a  permanent  can  never  be  overwritten.  Addressing-logic 
transients  last  one  computation  cycle  and  permanents  will  cause  the  memory  module  to 
endlessly  access  random  words.  An  access  to  a  random  address  reads  or  writes  a  value  with 
randomly  corrupted  bits,  representing  the  difference  between  the  bit  values  of  the  word  that 
was  accessed  and  the  word  that  should  have  been  accessed.  Finally,  faults  within  one  of  the 
error  control  logic  components  are  assumed  to  affect  a  single  random  bit  either  permanently 
or  for  one  computation  cycle  in  the  case  of  transients. 

4.2  Reliability  Analysis  of  Unidirectional  Voting  TMR  Systems 

Computer  systems  used  in  aircraft  ni  l  reactor  control  often  require  critically  high  reliability 
for  moderately  short  mission  time-  Triple  modular  redundant  (TMR)  hardware  has  been 
employed  in  many  of  these  ultralngli  reliability  applications.  The  three  redundant  processors 
of  a  TMR  system  concurrently  execute  identical  tasks  while  the  triplicated  memories  contain 
the  same  code  and  data.  Majority  voting  is  used  to  mask  erroneous  module  outputs.  As  seen 
in  Figure  7,  the  voter  (V)  is  usually  inserted  into  the  redundant  system  buses  between  the 
processors  (P)  and  memories  (M)  Bit-wise  voting  is  typically  performed  on  data,  address 
and  control  lines  during  both  read  and  write  accesses  to  memory.  Such  a  system  will  be 


referred  to  as  bidirectional  voting  (BDV)  TMR. 


Figure  7:  TMR  System  with  Bidirectional  Voting 

Voting  has  a  substantial  performance  penalty  associated  with  it.  This  degradation 
can  be  attributed  to  two  specific  delays  [8J.  The  propagation  delay  of  signals  through  the 
voter  logic  is  the  more  obvious  contributor  to  increased  memory  access  times.  Less  apparent 
is  the  synchronization  delay  incurred  when  clock  skew  requires  modules  to  wait  for  a 
lagging  signal  before  performing  a  vote.  This  penalty  becomes  even  greater  if  a  module  fails 
in  such  a  way  that  it  does  not  respond,  forcing  a  timeout  period  to  be  suffered  on  each 
memory  reference.  TMR  systems  used  in  hard  real-time  applications  may  not  be  able  to 
tolerate  the  ensuing  drop  in  throughput  after  this  type  of  failure. 

It  is  possible  to  significantly  reduce  the  performance  degradation  of  a  BDV  system  by 
voting  only  on  one  type  of  memory  access,  either  reads  or  writes.  These  unidirectional 
voting  systems  are  expected  to  have  lower  reliability  than  the  bidirectional  design  since  a 
smaller  fraction  of  errors  will  be  masked,  possibly  allowing  them  to  propagate  and  corrupt 
the  state  of  non- faulty  modules. 

Because  the  voter  may  be  by-passed  on  either  memory  read  or  write  accesses  to 
achieve  higher  performance,  two  different  unidirectional  voting  systems  exist.  The  Read- 
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Only  Voting  (ROV)  TMR  system  removes  the  voting  delays  from  the  bus  cycle  on  writes. 
Processor  generated  read  addresses  and  memory  outputs  are  voted  upon  and  a  single  voted 
value  is  distributed  to  all  three  processors.  Processor  outputs  are  written  straight  into  the 
corresponding  memories  without  any  error  masking  The  ROV  TMR  system  therefore  allows 
processor  errors  to  propagate  into  the  memories  while  all  single  errors  from  memory  will  be 
contained  by  the  voter. 

The  dual  of  the  ROV  system  is  the  Wnte-Only  Voting  (WOV)  TMR  system  which 
eliminates  the  delay  associated  with  voting  on  read  accesses.  It  performs  a  vote  only  at  the 
outputs  of  the  processors  and  writes  a  single  voted  value  into  all  three  memories  at  a  voted 
address.  No  masking  of  data  or  addressing  errors  takes  place  during  reads,  so  erroneous 
memory  outputs  may  propagate  directly  into  the  associated  processors.  Voting  terminates 
any  single  processor  error  before  it  reaches  the  memories. 

Both  unidirectional  voting  TMR  systems  can  realize  better  performance  than  the  tra¬ 
ditional  bidirectional  voting  system.  However,  WOV  should  have  better  performance 
than  ROV  because  it  suffers  the  delays  of  voting  less  often  since  reads  generally  occur 
much  more  frequently  than  writes.  In  terms  of  fault-tolerance,  one  might  expect  ROV  'o 
provide  higher  reliability  than  WOV  for  similar  reasons.  When  processors  and  mem¬ 
ories  experience  faults  at  the  same  rate,  the  percentage  of  potentially  fatal  errors  that  will 
get  masked  will  be  larger  with  the  ROV  system.  In  addition,  memory  often  has  a  higher 
fault  rate  than  processors  so  the  percentage  of  errors  masked  will  be  even  greater  when  the 
voter  is  placed  at  the  output  of  the  less  reliable  component. 

Two  parametric  analyses  f  the  bidirectional  and  unidirectional  voting  TMR  systems 
were  carried  out  with  REACT  1  i.  11  Figure  8  shows  a  typical  reliability  plot  from  this 
investigation.  The  following  observations  were  made: 

•  the  tradeoff  of  reliability  for  performance  made  by  the  unidirectional  voting  systems 
becomes  more  effective  as  the  difference  between  processor  and  memory  module  failure 
rates  increases 

•  near  ideal  tradeoffs  can  be  attained  for  some  failure  rate  combinations,  particularly 
when  memory  is  more  likely  to  fail  than  the  processors 


» 


•  the  analytical  model  traditionally  used  to  predict  the  reliability  of  TMR  designs  is 
indicative  of  some  of  the  differences  between  the  bidirectional  and  unidirectional  voting 
systems,  but  is  not  always  accurate 

•  reliability  of  the  ROV  system  is  generally  better  than  the  WOV  system,  except  when 
processor  failure  rates  are  high  relative  to  the  memory  failure  rates 

•  system  failure  is  caused  by  propagation  of  errors  more  often  in  the  WOV  system  than 
in  the  ROV  system 

•  workload  has  limited  effects  on  reliability  when  memory  error  latency  is  low 

Results  demonstrated  that  in  many  cases,  acceptably  little  reliability  was  sacrificed  by  the 
unidirectional  voting  TMR  systems  for  a  potentially  large  increase  in  performance. 
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Figure  8:  Example  Reliability  Plot  from  Analysis  with  REACT 
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5  Safe  System  Level  Diagnosis 

System  level  diagnosis  has  until  now,  focussed  on  location  of  faulty  nodes  in  a  system. 

A  novel  low-cost  approach  termed  safe  diagnosis  has  been  developed.  Diagnosing  a  large  ► 

number  of  faulty  nodes  requires  a  large  number  of  diagnostic  tests.  The  proposed  diagnosis 

approach  alleviates  the  high  cost  of  system  level  diagnosis  by  reducing  the  number  of  tests 

carried  out  periodically.  By  combining  fault  location  with  fault  detection,  this  approach 

achieves  high  levels  of  diagnostic  safety  and  recoverability.  Systems  which  can  guarantee  * 

correct  diagnosis  of  up  to  t  faults,  and  fault  detection  up  to  u  faults,  u  >  t,  have  been 

analyzed  [15]. 

Systems  that  can  perform  safe  system  level  diagnosis  in  the  presence  of  permanent  ( 

as  well  as  intermittent  faults  have  been  characterized.  The  complexity  of  safe  diagnosis 
algorithms  is  shown  to  be  comparable  with  the  diagnosis  algorithms  for  systems  performing 
only  fault  location.  When  only  permanent  faults  are  present,  achieving  a  large  fault  detection 
capability  in  addition  to  an  existing  fault  location  capability  requires  only  minimal  additional  • 

test  overhead.  The  testing  overhead  for  intermittent  fault  detection  is  larger  compared  to 
permanent  fault  detection. 

An  adaptive  diagnosis  algorithm  that  performs  fault  location  and  detection  is  pro¬ 
posed  for  the  permanent  fault  case.  Compared  to  any  adaptive  algorithm  for  pure  fault 
location,  our  algorithm  requires  just  one  additional  test  in  the  worst  case.  A  distributed 
algorithm  is  also  proposed  for  safe  diagnosis  of  distributed  multiprocessor  systems.  Repair 
of  distributed  systems  requires  that  an  external  user  be  able  to  decide  the  status  of  all  the 
system  nodes  or  detect  a  fault  situation  beyond  the  fault  location  capability  of  the  system. 

Algorithms  for  such  user  diagnosis  f  a  distributed  system  have  been  developed. 

The  concept  of  safe  diagnosis  can  be  used  for  adaptive  <-diagnosis  on  any  t-diagnosable 
system,  not  necessarily  with  the  traditionally-used  fully  connected  testing  graph.  An  adap¬ 
tive  algorithm  for  t-diagnosis  on  t -diagnosable  testing  graphs  has  been  designed. 

From  the  results  obtained  under  the  AFOSR  grant,  it  is  clear  that  the  safe  diagnosis 
approach  results  in  low-cost.  Thus,  the  proposed  safe  diagnosis  approach  is  of  significant 
interest  from  a  practical  viewpoint. 
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6  Safe  Modular  Redundant  Systems 
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Dependability  considerations  warrant  that  in  addition  to  reliability,  a  dependable  system 
must  have  a  high  level  of  safety.  Therefore,  there  is  a  need  to  ensure  operation  which  is  both 
error-free  under  adverse  conditions,  as  well  as  safe  under  severely  adverse  conditions. 

We  have  analyzed  a  technique  for  implementing  systems  requiring  high  reliability  and 
safety  [16],  These  systems,  named  n-Safe  modular  redundant  (nSMR)  systems,  achieve  high 
reliability  and  safety  using  module  replication  and  redundancy  in  module  output. 

An  nSMR  system  consists  of  n  identical  modules  and  an  arbiter.  The  arbiter  uses 
outputs  of  all  the  n  modules  to  decide  the  nSMR  system  output.  Reliability  and  safety  of 
the  system  are  a  function  of  the  arbitration  strategy  used.  When  reliability  is  the  only  cri¬ 
terion,  an  optimal  arbitration  strategy  that  maximizes  the  reliability  can  be  designed.  With 
reliability  and  safety  both  of  concern,  usually  no  single  arbitration  strategy  is  optimal.  We 
have  presented  an  implementation  of  maximal  arbitration  strategies  which  achieve  different 
maximal  reliability  and  safety  combinations.  Maximal  arbitration  strategies  are  such  that 
no  arbitration  strategy  has  better  reliability  and  safety,  compared  to  a  maximal  strategy. 

The  effect  of  increasing  redundancy  on  the  achievable  reliability  and  safety  has  been 
analyzed  for  systems  with  and  without  redundant  module  outputs.  Detailed  results  on 
binary  SMR  systems  using  binary  arbiters  have  also  been  obtained.  The  results  of  this 
chapter  are  summarized  below 

•  It  is  shown  that  for  modules  without  output  redundancy,  no  arbitration  strategy  exists 
for  (n  +  1)SMR  which  achieve,  |;.-tter  reliability  and  safety  compared  to  certain  arbi¬ 
tration  strategies  for  nSMH  further,  given  any  arbitration  strategy  for  nSMR,  there 
always  exists  an  arbitration  strategy  for  (n  +  2)SMR  that  achieves  higher  reliability 
and  safety. 

•  It  is  shown  that  if  modules  have  output  redundancy,  given  an  arbitration  strategy  for 
nSMR,  one  can  always  find  an  arbitration  strategy  for  (n+  1)SMR  that  achieves  better 
reliability  and  safety. 
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•  A  detailed  analysis  of  binary  nSMR  systems  with  single  bit  output  is  presented. 

Whether  binary  (n  -t-  1)SMR  dominates  binary  nSMR  is  shown  to  be  dependent  on 
the  relation  between  the  likelihood  of  a  detected  error  (pj)  and  the  likelihood  of  an 
undetected  error  ( pu )  in  a  binary  module’s  output.  It  is  shown  that  when  pj  =  pui  » 

binary  (n  +  1)SMR  does  not  dominate  any  of  the  plurality  strategies  for  binary  nSMR. 

Also,  exact  expressions  for  the  reliability  and  safety  of  the  maximal  strategies  for  such 
systems  have  been  presented. 

» 

•  Design  of  a  family  of  threshold-based  maximal  arbitration  strategies  which  achieve 
different  reliability  and  safety  is  presented.  Design  of  a  class  of  arbitration  strate¬ 
gies  easier  to  implement  as  compared  to  the  threshold-based  arbitration  strategies  is 

also  presented.  These  arbitration  strategies  are  obtained  by  generalizing  the  plurality  ► 

strategies. 
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